feat(query): Implement Vector Index with HNSW Algorithm #18134

b41sh · 2025-06-11T03:30:58Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR introduces a vector index to Databend, leveraging the Hierarchical Navigable Small World (HNSW) algorithm for efficient similarity search.

Key Features:

Vector Index with HNSW: Implements a vector index based on the HNSW algorithm, enabling fast and accurate approximate nearest neighbor search on VECTOR data. Creating a vector index requires specifying the following parameters to fine-tune performance and accuracy:
- m: Controls the number of connections (edges) per node in the HNSW graph. Higher values generally improve recall but increase index size and construction time.
- ef_construct: Controls the search width during index construction, representing the number of neighbors considered during the building process. Higher values lead to better index quality but increase construction time.
- distance: Specifies the supported distance calculation function(s) for the index. Acceptable values are cosine, l1, and l2. Multiple distance functions can be configured for a single index.
Distance Function Support: Provides comprehensive distance metric support for various similarity calculations:
- cosine_distance: Calculates the cosine distance between vectors, suitable for measuring the angle between vectors and identifying semantic similarity.
- l1_distance: Calculates the L1 distance (Manhattan distance) between vectors.
- l2_distance: Calculates the L2 distance (Euclidean distance) between vectors.
L1 Distance Function Implementation: As part of this PR, the l1_distance function was implemented to provide a complete set of distance functions.

Implementation Details:

The implementation of the HNSW algorithm is primarily based on modifications to the excellent open-source HNSW implementation from github.com/qdrant/qdrant. We would like to express our sincere gratitude to the Qdrant team for their valuable work, which significantly accelerated the development of this feature.

Example Usage:

-- Create a table with a vector column and vector index.
CREATE TABLE t(id Int, embedding Vector(128), VECTOR INDEX idx (embedding) m=10 ef_construct=40 distance='cosine,l1,l2') Engine = Fuse;
-- Copy into sift 1000000 test vector data
COPY INTO t FROM 'fs:///data1/b41sh/sift/sift.csv' FILE_FORMAT = (type = CSV field_delimiter='|');

-- Select top 10 nearest vector data with query vector
SELECT
    uv.id,
    cosine_distance(uv.embedding, [0.0, 16.0, 35.0, 5.0, 32.0, 31.0, 14.0, 10.0, 11.0, 78.0, 55.0, 10.0, 45.0, 83.0, 11.0, 6.0, 14.0, 57.0, 102.0, 75.0, 20.0, 8.0, 3.0, 5.0, 67.0, 17.0, 19.0, 26.0, 5.0, 0.0, 1.0, 22.0, 60.0, 26.0, 7.0, 1.0, 18.0, 22.0, 84.0, 53.0, 85.0, 119.0, 119.0, 4.0, 24.0, 18.0, 7.0, 7.0, 1.0, 81.0, 106.0, 102.0, 72.0, 30.0, 6.0, 0.0, 9.0, 1.0, 9.0, 119.0, 72.0, 1.0, 4.0, 33.0, 119.0, 29.0, 6.0, 1.0, 0.0, 1.0, 14.0, 52.0, 119.0, 30.0, 3.0, 0.0, 0.0, 55.0, 92.0, 111.0, 2.0, 5.0, 4.0, 9.0, 22.0, 89.0, 96.0, 14.0, 1.0, 0.0, 1.0, 82.0, 59.0, 16.0, 20.0, 5.0, 25.0, 14.0, 11.0, 4.0, 0.0, 0.0, 1.0, 26.0, 47.0, 23.0, 4.0, 0.0, 0.0, 4.0, 38.0, 83.0, 30.0, 14.0, 9.0, 4.0, 9.0, 17.0, 23.0, 41.0, 0.0, 0.0, 2.0, 8.0, 19.0, 25.0, 23.0, 1.0]::vector(128)) AS similarity_score
FROM
    t uv
ORDER BY
    similarity_score ASC
limit 10

╭────────────────────────────────────╮
│        id       │ similarity_score │
│ Nullable(Int32) │      Float32     │
├─────────────────┼──────────────────┤
│               1 │      0.021507084 │
│               3 │       0.05743867 │
│               7 │       0.07149047 │
│           83607 │       0.10362792 │
│          631204 │       0.11076915 │
│          677835 │       0.11215311 │
│          246711 │       0.11223245 │
│          677794 │       0.11382812 │
│          480593 │       0.11518437 │
│          725638 │      0.116248846 │
╰────────────────────────────────────╯
10 rows read in 0.284 sec. Processed 925.96 thousand rows, 0 B (3.26 million rows/s, 0 B/s) (without cache)
10 rows read in 0.020 sec. Processed 925.96 thousand rows, 0 B (46.3 million rows/s, 0 B/s) (with cache)

-- explain display the vector pruning
explain SELECT
    uv.id,
    l1_distance(uv.embedding, [0.0, 16.0, 35.0, 5.0, 32.0, 31.0, 14.0, 10.0, 11.0, 78.0, 55.0, 10.0, 45.0, 83.0, 11.0, 6.0, 14.0, 57.0, 102.0, 75.0, 20.0, 8.0, 3.0, 5.0, 67.0, 17.0, 19.0, 26.0, 5.0, 0.0, 1.0, 22.0, 60.0, 26.0, 7.0, 1.0, 18.0, 22.0, 84.0, 53.0, 85.0, 119.0, 119.0, 4.0, 24.0, 18.0, 7.0, 7.0, 1.0, 81.0, 106.0, 102.0, 72.0, 30.0, 6.0, 0.0, 9.0, 1.0, 9.0, 119.0, 72.0, 1.0, 4.0, 33.0, 119.0, 29.0, 6.0, 1.0, 0.0, 1.0, 14.0, 52.0, 119.0, 30.0, 3.0, 0.0, 0.0, 55.0, 92.0, 111.0, 2.0, 5.0, 4.0, 9.0, 22.0, 89.0, 96.0, 14.0, 1.0, 0.0, 1.0, 82.0, 59.0, 16.0, 20.0, 5.0, 25.0, 14.0, 11.0, 4.0, 0.0, 0.0, 1.0, 26.0, 47.0, 23.0, 4.0, 0.0, 0.0, 4.0, 38.0, 83.0, 30.0, 14.0, 9.0, 4.0, 9.0, 17.0, 23.0, 41.0, 0.0, 0.0, 2.0, 8.0, 19.0, 25.0, 23.0, 1.0]::vector(128)) AS similarity_score
FROM
    t uv
ORDER BY
    similarity_score ASC
limit 10

-[ EXPLAIN ]-----------------------------------
RowFetch
├── output columns: [uv._vector_score (#2), uv._row_id (#3), uv.id (#0)]
├── columns to fetch: [id]
├── estimated rows: 10.00
└── Limit
    ├── output columns: [uv._vector_score (#2), uv._row_id (#3)]
    ├── limit: 10
    ├── offset: 0
    ├── estimated rows: 10.00
    └── Sort
        ├── output columns: [uv._vector_score (#2), uv._row_id (#3)]
        ├── sort keys: [_vector_score ASC NULLS LAST]
        ├── estimated rows: 8000000.00
        └── TableScan
            ├── table: default.default.t
            ├── output columns: [_vector_score (#2), _row_id (#3)]
            ├── read rows: 976507
            ├── read size: 0
            ├── partitions total: 34
            ├── partitions scanned: 4
            ├── pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 34 to 34, vector pruning: 34 to 4>]
            ├── push downs: [filters: [], limit: 10]
            └── estimated rows: 8000000.00

-- Drop vector index
DROP VECTOR INDEX idx ON t;

-- Select top 10 nearest vector data with query vector without vector index
SELECT
    uv.id,
    cosine_distance(uv.embedding, [0.0, 16.0, 35.0, 5.0, 32.0, 31.0, 14.0, 10.0, 11.0, 78.0, 55.0, 10.0, 45.0, 83.0, 11.0, 6.0, 14.0, 57.0, 102.0, 75.0, 20.0, 8.0, 3.0, 5.0, 67.0, 17.0, 19.0, 26.0, 5.0, 0.0, 1.0, 22.0, 60.0, 26.0, 7.0, 1.0, 18.0, 22.0, 84.0, 53.0, 85.0, 119.0, 119.0, 4.0, 24.0, 18.0, 7.0, 7.0, 1.0, 81.0, 106.0, 102.0, 72.0, 30.0, 6.0, 0.0, 9.0, 1.0, 9.0, 119.0, 72.0, 1.0, 4.0, 33.0, 119.0, 29.0, 6.0, 1.0, 0.0, 1.0, 14.0, 52.0, 119.0, 30.0, 3.0, 0.0, 0.0, 55.0, 92.0, 111.0, 2.0, 5.0, 4.0, 9.0, 22.0, 89.0, 96.0, 14.0, 1.0, 0.0, 1.0, 82.0, 59.0, 16.0, 20.0, 5.0, 25.0, 14.0, 11.0, 4.0, 0.0, 0.0, 1.0, 26.0, 47.0, 23.0, 4.0, 0.0, 0.0, 4.0, 38.0, 83.0, 30.0, 14.0, 9.0, 4.0, 9.0, 17.0, 23.0, 41.0, 0.0, 0.0, 2.0, 8.0, 19.0, 25.0, 23.0, 1.0]::vector(128)) AS similarity_score
FROM
    t uv
ORDER BY
    similarity_score ASC
limit 10

╭─────────────────────────────────────╮
│        id       │  similarity_score │
│ Nullable(Int32) │      Float32      │
├─────────────────┼───────────────────┤
│               1 │ -0.00000011920929 │
│               3 │       0.036221445 │
│               7 │       0.047054827 │
│           83607 │        0.08021164 │
│          631204 │        0.08787441 │
│          677835 │        0.08972484 │
│          246711 │       0.090816796 │
│          677794 │       0.091252804 │
│          480593 │         0.0922364 │
│           10337 │         0.0925557 │
╰─────────────────────────────────────╯
10 rows read in 0.638 sec. Processed 1 million rows, 492.33 MiB (1.64 million rows/s, 809.76 MiB/s) (without cache)
10 rows read in 0.580 sec. Processed 1 million rows, 492.33 MiB (1.64 million rows/s, 807.11 MiB/s) (with cache)

part of: #17972

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

github-actions · 2025-07-10T08:46:20Z

Docker Image for PR

tag: pr-18134-8bdc45c-1752137107

note: this image tag is only available for internal use.

github-actions · 2025-07-10T14:17:53Z

Docker Image for PR

tag: pr-18134-5437daf-1752156992

note: this image tag is only available for internal use.

github-actions · 2025-07-13T07:36:35Z

Docker Image for PR

tag: pr-18134-b4885b7-1752392120

note: this image tag is only available for internal use.

BohuTANG · 2025-07-13T14:39:28Z

From my test, almost works 👍

github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Jun 11, 2025

b41sh force-pushed the feat-vector-hnsw branch 2 times, most recently from 01352f1 to ec5e851 Compare June 19, 2025 05:36

b41sh force-pushed the feat-vector-hnsw branch from ec5e851 to 0b36252 Compare July 10, 2025 03:56

b41sh marked this pull request as ready for review July 10, 2025 06:28

b41sh requested review from sundy-li and BohuTANG July 10, 2025 06:28

BohuTANG added the ci-cloud Build docker image for cloud test label Jul 10, 2025

b41sh force-pushed the feat-vector-hnsw branch from 20e788e to 0dc2c04 Compare July 10, 2025 11:25

BohuTANG added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Jul 10, 2025

b41sh added 5 commits July 13, 2025 14:16

feat(query): Implement Vector Index with HNSW Algorithm

96d16a0

support explain display vetor pruning, add write logs

779dfdf

fuse_block add vector_index_size

eeeda23

multi thread pruning

16a7800

Merge branch 'main' into feat-vector-hnsw

322727c

b41sh force-pushed the feat-vector-hnsw branch from 5b2cc3f to 322727c Compare July 13, 2025 06:17

BohuTANG added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Jul 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(query): Implement Vector Index with HNSW Algorithm #18134

feat(query): Implement Vector Index with HNSW Algorithm #18134

Uh oh!

b41sh commented Jun 11, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 10, 2025

Uh oh!

github-actions bot commented Jul 10, 2025

Uh oh!

github-actions bot commented Jul 13, 2025

Uh oh!

BohuTANG commented Jul 13, 2025

Uh oh!

Uh oh!

feat(query): Implement Vector Index with HNSW Algorithm #18134

Are you sure you want to change the base?

feat(query): Implement Vector Index with HNSW Algorithm #18134

Uh oh!

Conversation

b41sh commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Type of change

Uh oh!

github-actions bot commented Jul 10, 2025

Docker Image for PR

Uh oh!

github-actions bot commented Jul 10, 2025

Docker Image for PR

Uh oh!

github-actions bot commented Jul 13, 2025

Docker Image for PR

Uh oh!

BohuTANG commented Jul 13, 2025

Uh oh!

Uh oh!

b41sh commented Jun 11, 2025 •

edited

Loading